Grade Clustering and Seriation of Words Based on Their Co- Occurrences

نویسندگان

  • Emilia Jarochowska
  • Krzysztof Ciesielski
چکیده

We present the use of grade correspondence analysis (GCA) in text mining. A sample of words extracted from 20 newsgroups has been linearly arranged according to concordance between their co-occurrence distributions. Words’ co-occurrence matrix, obtained using HAL (Hyperspace Analogue to Language) system and normalized to deemphasize too frequent terms, has been reordered by the GCA algorithm, implemented in the GradeStat program. The aim of this reordering was to approach TP2 regularity of dependence, measured using Kendall’s tau. Deviations from regularity have been used to order words into series, cluster them and visualize using overrepresentation maps. Seriation provided a contextual scale between computerrelated terms and politicsand religion-related terms. Words appearing in various contexts occupy average positions on this scale. Word ordering based on similarity between their occurrence patterns can be used in thesauri building and query extension in information retrieval applications.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

یک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجره‌های هم‌پوشان

A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...

متن کامل

A Seriation Approach for Visualization-Driven Discovery of Co-Expression Patterns in Serial Analysis of Gene Expression (SAGE) Data

BACKGROUND Serial Analysis of Gene Expression (SAGE) is a DNA sequencing-based method for large-scale gene expression profiling that provides an alternative to microarray analysis. Most analyses of SAGE data aimed at identifying co-expressed genes have been accomplished using various versions of clustering approaches that often result in a number of false positives. PRINCIPAL FINDINGS Here we...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

Combining Syntactic Co-occurrences and Nearest Neighbours in Distributional Methods to Remedy Data Sparseness.

The task of automatically acquiring semantically related words have led people to study distributional similarity. The distributional hypothesis states that words that are similar share similar contexts. In this paper we present a technique that aims at improving the performance of a syntax-based distributional method by augmenting the original input of the system (syntactic co-occurrences) wit...

متن کامل

The Intellectual Structure of Knowledge in the Field of Distance Education Using the Co-Word analyses

Background: Co- word analysis is one of the content analysis methods used in scientometric studies and mapping the scientific structure of various fields. The purpose of the present research is to map the structure of distance education using the co-word analysis. Methods: The research method is content analysis using co- word analysis. The research population are 31607 documents indexed in the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006